Data-driven Science
   HOME

TheInfoList



OR:

Data science is an
interdisciplinary Interdisciplinarity or interdisciplinary studies involves the combination of multiple academic disciplines into one activity (e.g., a research project). It draws knowledge from several other fields like sociology, anthropology, psychology, ec ...
field that uses
scientific method The scientific method is an empirical method for acquiring knowledge that has characterized the development of science since at least the 17th century (with notable practitioners in previous centuries; see the article history of scientific m ...
s, processes,
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
s and systems to extract or extrapolate
knowledge Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is distinc ...
and insights from noisy, structured and
unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, num ...
, and apply knowledge from data across a broad range of application domains. Data science is related to data mining,
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
,
big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
,
computational statistics Computational statistics, or statistical computing, is the bond between statistics and computer science. It means statistical methods that are enabled by using computational methods. It is the area of computational science (or scientific computi ...
and
analytics Analytics is the systematic computational analysis of data or statistics. It is used for the discovery, interpretation, and communication of meaningful patterns in data. It also entails applying data patterns toward effective decision-making. It ...
. Data science is a "concept to unify
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
,
data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...
,
informatics Informatics is the study of computational systems, especially those for data storage and retrieval. According to ACM ''Europe and'' ''Informatics Europe'', informatics is synonymous with computer science and computing as a profession, in which ...
, and their related
method Method ( grc, μέθοδος, methodos) literally means a pursuit of knowledge, investigation, mode of prosecuting such inquiry, or system. In recent centuries it more often means a prescribed process for completing a task. It may refer to: *Scien ...
s" in order to "understand and analyse actual
phenomena A phenomenon ( : phenomena) is an observable event. The term came into its modern philosophical usage through Immanuel Kant, who contrasted it with the noumenon, which ''cannot'' be directly observed. Kant was heavily influenced by Gottfried W ...
" with
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
. It uses techniques and theories drawn from many fields within the context of
mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
, statistics,
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
,
information science Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, Categorization, classification, manipulation, storage, information retrieval, retrieval, movement, dissemin ...
, and
domain knowledge Domain knowledge is knowledge of a specific, specialized discipline or field, in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engin ...
. However, data science is different from
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
and information science.
Turing Award The ACM A. M. Turing Award is an annual prize given by the Association for Computing Machinery (ACM) for contributions of lasting and major technical importance to computer science. It is generally recognized as the highest distinction in compu ...
winner Jim Gray imagined data science as a "fourth paradigm" of science (
empirical Empirical evidence for a proposition is evidence, i.e. what supports or counters this proposition, that is constituted by or accessible to sense experience or experimental procedure. Empirical evidence is of central importance to the sciences and ...
,
theoretical A theory is a rational type of abstract thinking about a phenomenon, or the results of such thinking. The process of contemplative and rational thinking is often associated with such processes as observational study or research. Theories may be s ...
,
computational Computation is any type of arithmetic or non-arithmetic calculation that follows a well-defined model (e.g., an algorithm). Mechanical or electronic devices (or, historically, people) that perform computations are known as ''computers''. An espe ...
, and now data-driven) and asserted that "everything about science is changing because of the impact of
information technology Information technology (IT) is the use of computers to create, process, store, retrieve, and exchange all kinds of data . and information. IT forms part of information and communications technology (ICT). An information technology system (I ...
" and the
data deluge The information explosion is the rapid increase in the amount of published information or data and the effects of this abundance. As the amount of available data grows, the problem of managing the information becomes more difficult, which can lead ...
. A data scientist is someone who creates programming code and combines it with statistical knowledge to create insights from data.


Foundations

Data science is an
interdisciplinary Interdisciplinarity or interdisciplinary studies involves the combination of multiple academic disciplines into one activity (e.g., a research project). It draws knowledge from several other fields like sociology, anthropology, psychology, ec ...
field Field may refer to: Expanses of open ground * Field (agriculture), an area of land used for agricultural purposes * Airfield, an aerodrome that lacks the infrastructure of an airport * Battlefield * Lawn, an area of mowed grass * Meadow, a grass ...
focused on extracting knowledge from typically
large Large means of great size. Large may also refer to: Mathematics * Arbitrarily large, a phrase in mathematics * Large cardinal, a property of certain transfinite numbers * Large category, a category with a proper class of objects and morphisms (or ...
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
s and applying the knowledge and insights from that data to solve problems in a wide range of application domains. The field encompasses preparing data for analysis, formulating data science problems,
analyzing Analysis (plural, : analyses) is the process of breaking a complexity, complex topic or Substance theory, substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics a ...
data, developing data-driven solutions, and presenting findings to inform high-level decisions in a broad range of application domains. As such, it incorporates skills from computer science, statistics, information science, mathematics,
data visualization Data and information visualization (data viz or info viz) is an interdisciplinary field that deals with the graphic representation of data and information. It is a particularly efficient way of communicating when the data or information is num ...
,
information visualization Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...
,
data sonification Data sonification is the presentation of data as sound using sonification. It is the auditory equivalent of the more established practice of data visualization. The usual process for data sonification is directing digital media of a dataset throug ...
, data
integration Integration may refer to: Biology *Multisensory integration *Path integration * Pre-integration complex, viral genetic material used to insert a viral genome into a host genome *DNA integration, by means of site-specific recombinase technology, ...
,
graphic design Graphic design is a profession, academic discipline and applied art whose activity consists in projecting visual communications intended to transmit specific messages to social groups, with specific objectives. Graphic design is an interdiscipli ...
,
complex systems A complex system is a system composed of many components which may interact with each other. Examples of complex systems are Earth's global climate, organisms, the human brain, infrastructure such as power grid, transportation or communication s ...
,
communication Communication (from la, communicare, meaning "to share" or "to be in relation with") is usually defined as the transmission of information. The term may also refer to the message communicated through such transmissions or the field of inquir ...
and
business Business is the practice of making one's living or making money by producing or Trade, buying and selling Product (business), products (such as goods and Service (economics), services). It is also "any activity or enterprise entered into for pr ...
. Statistician
Nathan Yau Nathan Yau is an American statistician and data visualization expert. Early life Nathan Chun-Yin Yau grew up in Fresno, California. He received a Bachelor of Science in electrical engineering and computer science from the University of Californ ...
, drawing on
Ben Fry Benjamin Fry is an American designer who has expertise in data visualization. Early life and education Fry was born in 1975 in Ann Arbor, Michigan, Ann Arbor, Michigan (born 1975)."Inside design now: National Design Triennial", by Ellen Lupton ...
, also links data science to
human–computer interaction Human–computer interaction (HCI) is research in the design and the use of computer technology, which focuses on the interfaces between people (users) and computers. HCI researchers observe the ways humans interact with computers and design tec ...
: users should be able to intuitively control and explore data. In 2015, the
American Statistical Association The American Statistical Association (ASA) is the main professional organization for statisticians and related professionals in the United States. It was founded in Boston, Massachusetts on November 27, 1839, and is the second oldest continuousl ...
identified
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
management, statistics and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, and distributed and parallel systems as the three emerging foundational professional communities.


Relationship to statistics

Many statisticians, including
Nate Silver Nathaniel Read Silver (born January 13, 1978) is an American statistician, writer, and poker player who analyzes baseball (see sabermetrics), basketball, and elections (see psephology). He is the founder and editor-in-chief of ''FiveThirtyEight' ...
, have argued that data science is not a new field, but rather another name for statistics. Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data.
Vasant Dhar Vasant Dhar is a professor at the Stern School of Business and the Center for Data Science at New York University, former editor-in-chief of the journal ''Big Data'' and the founder of SCT Capital, one of the first machine-learning-based hedge ...
writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. from images, text, sensors, transactions or customer information, etc) and emphasizes prediction and action.
Andrew Gelman Andrew Eric Gelman (born February 11, 1965) is an American statistician and professor of statistics and political science at Columbia University. Gelman received bachelor of science degrees in mathematics and in physics from MIT, where he was a ...
of
Columbia University Columbia University (also known as Columbia, and officially as Columbia University in the City of New York) is a private research university in New York City. Established in 1754 as King's College on the grounds of Trinity Church in Manhatt ...
has described statistics as a nonessential part of data science. Stanford professor
David Donoho David Leigh Donoho (born March 5, 1957) is an American statistician. He is a professor of statistics at Stanford University, where he is also the Anne T. and Robert M. Bass Professor in the Humanities and Sciences. His work includes the develop ...
writes that data science is not distinguished from statistics by the size of datasets or use of computing and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data-science program. He describes data science as an applied field growing out of traditional statistics.


Etymology


Early usage

In 1962,
John Tukey John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
described a field he called "data analysis", which resembles modern data science. In 1985, in a lecture given to the Chinese Academy of Sciences in Beijing, C. F. Jeff Wu used the term "data science" for the first time as an alternative name for statistics. Later, attendees at a 1992 statistics symposium at the University of Montpellier II acknowledged the emergence of a new discipline focused on data of various origins and forms, combining established concepts and principles of statistics and data analysis with computing. The term "data science" has been traced back to 1974, when
Peter Naur Peter Naur (25 October 1928 – 3 January 2016) was a Danish computer science pioneer and Turing award winner. He is best remembered as a contributor, with John Backus, to the Backus–Naur form (BNF) notation used in describing the syntax for mo ...
proposed it as an alternative name for computer science. In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic. However, the definition was still in flux. After the 1985 lecture at the Chinese Academy of Sciences in Beijing, in 1997 C. F. Jeff Wu again suggested that statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being synonymous with accounting or limited to describing data. In 1998, Hayashi Chikio argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis. During the 1990s, popular terms for the process of finding patterns in datasets (which were increasingly large) included "knowledge discovery" and " data mining".


Modern usage

In 2012, technologists
Thomas H. Davenport Thomas Hayes "Tom" Davenport, Jr. (born October 17, 1954) is an American academic and author specializing in analytics, business process innovation, knowledge management, and artificial intelligence. He is currently the President’s Distinguishe ...
and
DJ Patil Dhanurjay "DJ" Patil (born August 3, 1974) is an American mathematician and computer scientist who served as the Chief Data Scientist of the United States Office of Science and Technology Policy from 2015 to 2017. He is the Head of Technology ...
declared "Data Scientist: The Sexiest Job of the 21st Century", a catch-phrase that was picked up even by major-city newspapers like the
New York Times ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid d ...
and the
Boston Globe ''The Boston Globe'' is an American daily newspaper founded and based in Boston, Massachusetts. The newspaper has won a total of 27 Pulitzer Prizes, and has a total circulation of close to 300,000 print and digital subscribers. ''The Boston Glob ...
. A decade later, they reaffirmed it, stating "the job is more in demand than ever with employers". The modern conception of data science as an independent discipline is sometimes attributed to William S. Cleveland. In a 2001 paper, he advocated an expansion of statistics beyond theory into technical areas; because this would significantly change the field, it warranted a new name. "Data science" became more widely used in the next few years: in 2002, the
Committee on Data for Science and Technology The Committee on Data of the International Science Council (CODATA) was established in 1966 as the Committee on Data for Science and Technology, originally part of the International Council of Scientific Unions, now part of the International ...
launched ''Data Science Journal''. In 2003, Columbia University launched ''The Journal of Data Science''. In 2014, the
American Statistical Association The American Statistical Association (ASA) is the main professional organization for statisticians and related professionals in the United States. It was founded in Boston, Massachusetts on November 27, 1839, and is the second oldest continuousl ...
's Section on Statistical Learning and Data Mining changed its name to the Section on Statistical Learning and Data Science, reflecting the ascendant popularity of data science. The professional title of "data scientist" has been attributed to
DJ Patil Dhanurjay "DJ" Patil (born August 3, 1974) is an American mathematician and computer scientist who served as the Chief Data Scientist of the United States Office of Science and Technology Policy from 2015 to 2017. He is the Head of Technology ...
and
Jeff Hammerbacher Jeff Hammerbacher is a data scientist. He was chief scientist and cofounder at Cloudera and later served on the faculty of the Icahn School of Medicine at Mount Sinai. Early life Hammerbacher grew up in Fort Wayne, Indiana. His father worked at th ...
in 2008. Though it was used by the
National Science Board The National Science Board (NSB) of the United States establishes the policies of the National Science Foundation (NSF) within the framework of applicable national policies set forth by the President and the Congress. The NSB also serves as an ind ...
in their 2005 report "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century", it referred broadly to any key role in managing a digital data collection. There is still no consensus on the definition of data science, and it is considered by some to be a
buzzword A buzzword is a word or phrase, new or already existing, that becomes popular for a period of time. Buzzwords often derive from technical terms yet often have much of the original technical meaning removed through fashionable use, being simply used ...
.
Big data Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe Big data is the one associated with large body of information that we could not comprehend when used only in smaller am ...
is a related marketing term. Data scientists are responsible for breaking down big data into usable information and creating software and algorithms that help companies and organizations determine optimal operations.


See also

* Open Data Science Conference *
Scientific Data In the pursuit of knowledge, data (; ) is a collection of discrete Value_(semiotics), values that convey information, describing quantity, qualitative property, quality, fact, statistics, other basic units of meaning, or simply sequences of sy ...
*
Women in Data Women in Data is an organisation and movement that aims to empower women and support them through the various stages of their careers in data. Although women comprise about 50% of the United Kingdom (UK) population, only 20% of professionals in a ...


References

{{Data Information science Computer occupations Computational fields of study Data analysis